The goal of this practical session is to use genetic markers to predict the geographical origin of a set of indians from South, Central, and North America. We propose to build two regression linear models to predict the latitude and longitude of an individual based on its genetic markers. Because the number of markers (\(p = 5709\)) is larger than the number of samples (\(N = 494\)), the predictors of the regression model will be the outputs of a principal component analysis (PCA) performed on the genetic markers. A genetic marker is encoded 1 if the individual has a mutation, 0 elsewhere.
\section*{$\blacktriangleright$~Exercise 1: Data}
Download the dataset from Chamilo. Each row corresponds to an individual and the columns have explicit names. The third column contains the names of the tribes to which each individual pertains. Columns 7 and 8 contain the latitude and the longitude and from Column 9 onwards are genetic markers.
Describe what the code below does and how it works (you can take a look at ). You should get the same figure as the one shown below.
NAm2 = read.table("NAm2.txt", header=TRUE)
names=unique(NAm2$Pop)
npop=length(names)
NAm2